Chapter 1: About the project

The project started on 19 January 2017. Updates will be available here: https://github.com/tlehtiniemi/IODS-project


Chapter 2: Regression and model validation

Let’s first read data from the local file into a data frame

learning2014 <- read.table("/Users/lehtint9/IODS-project/data/learning2014.txt")

Next let’s have a look a the dimensions and structure of the data set

dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.75 2.88 3.88 3.5 3.75 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data frame has 166 observations of 7 variables. The variables are gender (female or male), age, four attributes (attitude, deep, stra, surf) and the total points.

To examine the variables, let us print a summary of them. This show overview statistics of all variables.

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.625   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.500   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.875   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.796   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.250   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.875   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

We will need the following additional libraries to analyse the data graphically:

library(GGally)
library(ggplot2)

To explore the variables graphically and specifically the relationships between them, we will draw an overview plot matrix with ggpairs

p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

The gender distribution is clearly skewed, age as well (as expected with what I assume are university students). age does not significantly correlate with any of the variables.

The three highest linear correlations (in terms of abosulte value) with the points variable are in attitude (0.437), stra (0.146) and surf (-0.144), however correlations between points and stra & surf are rather low. surf seems to be to some extent correlated with attitude (-0.176) and stra (-0.161). Correlation between stra and attitude is low (0.0617).

Based on the observations on correlations, let’s regress points on attitude, stra and surf and print out a summary of the model

my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The statistically significant coefficients are the intercept (at 0.01 level) and attitude (at 0.001 level).

As stra and surf were not statistically significant explanatory variables, let’s remove them from the model

my_model2 <- lm(points ~ attitude, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

This had a small impact on the remaining coefficients and diminished the R-squared by a small amount (as expected) in comparison to the previous model. Now the intercept and attitude coefficients are both significant at 0.001 level. As the intercept coefficient is 11.6, a person with zero attitude would get 11.6 points according to the model. Thereafter each unit increase in attitude increases the points by about 0.35 according to the model. As per the multiple R-squared diagnostic, this model is capable of explaining about 19% of the variability of points variable around its mean.

Let’s plot this model to aid in diagnostics.

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")

Now on to diagnostics of the model.

par(mfrow = c(2,2))
plot(my_model2, c(1, 2, 5))

Residuals seem to be somewhat correlated with model predictions. Visually, the residuals seem smaller the higher the model predictions are, apart from a few outliers. This would indicate that the constant variance of errors assumption does not hold. On a visual inspection, even though it exists, this effect is not very severe.

Based on the QQ plot, the errors are not exactly normally distributed. Even though the errors follow the theoretical errors closely for low and mid-range errors (in absolute terms), the errors deviate from the line in the high end (in absolute terms) of the error distribution. The deviations however are comparatively low.

On the residuals vs leverage plot, we see that there are some obervations with comparatively high leverage. These are likely the ones with very low attitude and very low points, as they would “pull” the regression line downwards from the low end. However the leverage coefficients are not high in absolute terms for any observation. Therefore we sould probably not be too much worried about outliers.


Chapter 3: Logistic regression

Again let’s first load libraries and read data from the local file into a data frame

library(dplyr); library(ggplot2); library(boot)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc <- read.table("/Users/lehtint9/IODS-project/data/Ch3_alc.csv", sep=";", header=TRUE)

Load and explore data

The data contains responses to a survey for students of two classes. It has demographic and backrground variables, information about freetime and studies, and about alcohol consumption.

It has 35 variables for 382 students.

Variable names are

dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Exploration of variables to study

In this exercise we are interested in the alcohol consumption of students. We’ll start with the hyptohesis that four variables are related to alcohol consumption. It seems reasonable that the following variables could have a connection with alcohol consumption:

  1. Gender of the student. Hypothesis: males consume more alcohol. sex is the student’s sex (binary: ‘F’ - female or ‘M’ - male) (factor variable)
  2. Romantic relationship. Hypothesis: a romantic relationships is associated with less alcohol consumption. romantic tells if there is a romantic relationship (binary: yes or no) (factor variable)
  3. The time spent in studies. Hypothesis: more time spent in studies, less alcohol consumption. the variable studytime is the weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  4. Quality of family relations. Hypothesis: better quality, less alcohol. famrel is the quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
g1a <- ggplot(data = alc, aes(x=high_use, col=sex))
g1b <- g1a+geom_bar()+facet_wrap("sex")
g1b

There seems to be proportionally more high alcohol users in males than in females. Let’s look at this in cross-tabulation also. High use is on rows, gender on columns. There are proportionally more male high users.

mytable <- table(alc$high_use, alc$sex)
mytable # print table 
##        
##           F   M
##   FALSE 156 112
##   TRUE   42  72
prop.table(mytable, 2) # column percentages
##        
##                 F         M
##   FALSE 0.7878788 0.6086957
##   TRUE  0.2121212 0.3913043

Let’s look at all the other variables in relation to both high_use and sex.

First romantic relations.

mytable <- table(alc$high_use, alc$romantic)
mytable # print table 
##        
##          no yes
##   FALSE 180  88
##   TRUE   81  33
prop.table(mytable, 2) # column percentages
##        
##                no       yes
##   FALSE 0.6896552 0.7272727
##   TRUE  0.3103448 0.2727273

A higher percentage of those who are not in romantic relations are high alcohol users. The effect seems small though, but we’ll know more later. Let’s visualize this too.

g2a <- ggplot(data = alc, aes(x=high_use, col=sex))
g2b <- g2a+geom_bar()+facet_wrap("romantic")+ggtitle("Romantic relation, high alcohol use and gender")
g2b

Next, studytime

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_studytime
##   <fctr>    <lgl> <int>          <dbl>
## 1      F    FALSE   156       2.339744
## 2      F     TRUE    42       2.000000
## 3      M    FALSE   112       1.883929
## 4      M     TRUE    72       1.638889

With both males and females, studytimes are higher on the average with non-high-users. In males the average studytimes are lower overall than in females.

Then family relations or famrel.

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_famrel = mean(famrel))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_famrel
##   <fctr>    <lgl> <int>       <dbl>
## 1      F    FALSE   156    3.910256
## 2      F     TRUE    42    3.761905
## 3      M    FALSE   112    4.133929
## 4      M     TRUE    72    3.791667

With both males and females, family relations are better on the average with non-high-users. In males the difference in mean family realtions of high users and non-high users is higher.

Logistic regressions of chosen variables

Let’s fit a logistic model to the high_use data. We are fitting the model to find what factors are correlated with higher alcohol consumption among students.

m <- glm(high_use ~ sex + romantic + studytime + famrel, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + romantic + studytime + famrel, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4916  -0.8515  -0.6620   1.2041   2.1353  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.8894     0.6030   1.475  0.14022   
## sexM          0.7199     0.2448   2.941  0.00327 **
## romanticyes  -0.1301     0.2551  -0.510  0.61018   
## studytime    -0.4631     0.1562  -2.965  0.00302 **
## famrel       -0.3022     0.1262  -2.395  0.01662 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 434.86  on 377  degrees of freedom
## AIC: 444.86
## 
## Number of Fisher Scoring iterations: 4

Model summary shows that two of the variables are significant at the 0.01 level: gender and studytime. Family relations are significant at the 0.05 level. romantic relations did not have astatistically significant effect. Effect of sex was so that being male had a positive correlation with high alcohol use. studytime and famrel had a negative correlation; more studying and better family relations were correlated with less alcohol consumption. The directions of the statistically significant (0.01 or 0.05 level) effects were as hypothesized initially.

Below, the result of fitting the model are presented as odds ratios. This means that a unit change in the explanatory variable (vs. no change) changes the odds of high alcohol use by the corresponding odds ratio. Also confidence intervals are printed out.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 2.4336272 0.7466719 8.0045769
## sexM        2.0541668 1.2749343 3.3342592
## romanticyes 0.8780501 0.5285733 1.4404187
## studytime   0.6293367 0.4591485 0.8484560
## famrel      0.7391943 0.5763518 0.9468467

According to the model, males have the odds ratio of about 2, meaning that there’s a twofold chance of a male being high consumer of alcohol compared to a female.

romantic relations were not a statistically significant explanatory variable. As an additional note, in the odds ratios, we also see that the 95% confidence interval spans from 0.52 to 1.44, which indicates ambiquous relationship to alcohol consumption.

A unit increase in the studytime variable leads to 0.62-fold odds to consume more alcohol. A unit increase in family relations or famrel variable leads to 0.74-fold odds to consume more alcohol. Both have a negative effect to the odds of high_use, but practical interpretation of the effect is difficult due to the categorical nature of these variables. In particular, studytime is not in hours but is categorised with non-uniform intervals of hours, so this result should be interpreted in terms of jumps from category to another.

Let’s now refit the model, leaving out romantic relations - as important as they are in real life, they are not statistically significant in this case.

m2 <- glm(high_use ~ sex + studytime + famrel, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ sex + studytime + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.530  -0.854  -0.666   1.218   2.157  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.8395     0.5955   1.410  0.15861   
## sexM          0.7257     0.2444   2.970  0.00298 **
## studytime    -0.4679     0.1563  -2.994  0.00275 **
## famrel       -0.2982     0.1259  -2.368  0.01786 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 435.12  on 378  degrees of freedom
## AIC: 443.12
## 
## Number of Fisher Scoring iterations: 4

The summary statistics look as expected. famrel is significant only on 0.05 level, but let’s keep it in the model anyway.

Predictive power of the model

Now we’ll predict high use using this new model. We know the actual values of high_use, so we can compare those to predictions. Ideally there’s nothing on contradictory cells of the confusion matrix:

model_prob <- predict(m2, type = "response")
alc <- mutate(alc, high_use_prob = model_prob)
alc <- mutate(alc, high_use_pred = high_use_prob>0.5)
table(high_use = alc$high_use, prediction = alc$high_use_pred)
##         prediction
## high_use FALSE TRUE
##    FALSE   255   13
##    TRUE    102   12

Quite a few false negatives in the predictions. Let’s visualize the results a bit more.

g <- ggplot(alc, aes(x = high_use_prob, y = high_use, col=high_use_pred))
g+geom_point()

table(high_use = alc$high_use, high_use_pred = alc$high_use_pred) %>% prop.table() %>% addmargins()
##         high_use_pred
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66753927 0.03403141 0.70157068
##    TRUE  0.26701571 0.03141361 0.29842932
##    Sum   0.93455497 0.06544503 1.00000000

which basically tells the same story. The model predicts too many negatives. High training error at 0,43. Not a good prediction! Very bad. Sad!

To quantify this a bit, we’ll define a loss function and use it to compare predictions to simple guesswork.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

Now the guesswork: let’s guess that there are no high users of alcohol. The loss function is

loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293

And then let’s use the actual predictions. Now the loss function is

loss_func(class = alc$high_use, prob = alc$high_use_prob)
## [1] 0.3010471

Using this measure, the guesswork is actually a little bit better! So the model was actually really sad.

Cross validation and a better model

Let’s perform 10-fold cross validation on the model and print out the loss function for it.

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cv$delta[1]
## [1] 0.3115183

So this model performed worse than the model in datacamp, loss function .306 versus in datacamp .26.

What would be a better model? One way to approach this would be to start from the moel in datacamp (predictors were failures, absences, and sex) and add a predictor that increases the predictive power of the model. In this exercise, the “best” variable used as a predictor was studytime. Let’s add that to the model, perform the cross validation and print out the loss function.

m3 <- glm(high_use ~ sex + failures + absences + studytime, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
cv$delta[1]
## [1] 0.2670157

And voila, we have a better model (.24) than in the datacamp exercise (.26)!


Chapter 4: Clustering and classification

The Boston data

In this Chapter, we will be using a dataset on suburbs of Boston. We are aiming to employ classification and clustering on the dataset. In classification, we know the classes, and aim to fit a model that can classify observations into these classes based on other variables. We will classify the suburbs according to their crime rate. In clustering, we are trying to find similar groups or clusters of observations withouth prior knowledge of the clusters.

Let’s first load the libraries needed in this Chapter.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(corrplot)
library(NbClust)

The dataset is included in the MASS package. Let’s load the Boston dataset and explore it’s contents.

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The dataset has 14 variables and 506 observations. The variables describe characteristics of 506 suburbs of Boston, including the crime rate crim that we will use for classification, and 13 other variables.

We are most interested in the ‘crim’ variable that descibers the per capita crime rate of the area. The crime rate is a continuous variable that varies highly between areas: the max crime rate is really high compared to the median crime rate. Looking at the quantiles, we can see that this is partly due to outliers, as the 3rd quantile is much smaller than the max value. We note that also the 3rd quantile is already very high compared to the minimum value.

Also some of the other variables have uneven distributions: e.g. black (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940) and lstat (lower status of the population (percent)).

To explore the relations between the variables of the dataset, let’s print out pairwise scatter plots and a correlation plot.

pairs(Boston)

cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix)

By visual inspection, the crime rate is correlated with many of the variables. Again, since the crime rate is (and other variables are) highly variable, it is not easy to interpret the correlations from scatter plots. However, from the correlation plot, crime rate is negatively correlated with e.g. housing values and distances to employment centers, and positively correlated with e.g. access to radian highways rad and property tax rate tax. High correlation among other variables than crim are also found.

Scaling the dataset and categorising crime rate

LDA assumes that the variables are normally distributed and have the same variance. therefore, we’ll next standardize the dataset by subtracting the mean of the variable from each observation, and dividing this with the standard deviation of the variable.

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

The above summary of the scaled dataset shows that now each variable now has a mean of 0, that is, the distribution is centered around zero.

In order to make use of the crime rate in classification, it needs to be a categorical variable. Let’s replace the crim variable with a categorical version of the crime rate, or the crime variabel, by categorising crim into its quartiles. The crime categories are labeled low, med_low, med_high and high.

scaled_crim <- boston_scaled$crim
bins <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Fitting the LDA model

We’ll use an LDA model to classify the suburbs into crime rate classes. First we divide the dataset into a training dataset and a test dataset. We will perform classification on the training dataset, and then employ the test dataset to see how well the classification performs on new data. We do this by picking random 80% of the dataset for the training dataset, and then leave the rest as the test dataset. Test dataset does not inlude the crime rates; these are stored separately.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Now we’ll use LDA classification to fit a model that classifies the training dataset into the crime classes. We make use of the lda.arrows function as presented in the datacamp exercise in order to print out a biplot, that is, a plot that shows a scatter plot of the classes according to the .

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Predicting with the LDA model

Now we’ll predict the crime rates of the test dataset based on the model fitted on the training dataset.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      16        4    0
##   med_low    0      12        8    0
##   med_high   1       9       15    1
##   high       0       0        0   24

Most of the predictions are at the diagonal of the cross tabulation. Prediction error was about 25%. This is probably an ok result

K-means clustering

With K-means clustering, we are aiming to find clusters of similar observations from the data, without prior knowledge of these clusters. For clustering, we’ll use the scaled Boston dataset, so that the distances are comparable. The following print out summaries of the Euclidian and Manhattan distances matrices of the scaled Boston dataset. Euclidian in the geometric distance, Manhattan is the distane measured along the axes.

set.seed(123)
boston_clusters <- as.data.frame(scale(Boston))
dist_eu <- dist(boston_clusters)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
dist_man <- dist(boston_clusters, method="manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4830 12.6100 13.5500 17.7600 48.8600

Not surprisingly, Manhattan distances are higher.

Let’s perform the K-means clustering with Euclidian distances and K=3 clusters

km <-kmeans(dist_eu, centers = 3)
pairs(Boston, col = km$cluster)

For the purpose of finding optimal number of clusters, we’ll explore the total within cluster sum of squares (twcss) with the number of clusters ranging from 1 to 10.

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

The interpretation the the twcss number is, that the optimal number of clusters is when the twcss drops radically. There, it obviously drops radically when K changes from 1 to 2. Also 2 to 3 might be interpreted as a radical drop (the situation depends on the locations of the initial (random) cluster centers, but this drop was about 25% of twcss when initial centers were assigned using the set.seed(123) function as above). We could stick to K=2 as the optimal, but this is debatable.

So how to determine the number of clusters? Let’s run a set of tests to see if we can find some consensus. Used in this way, NbClust performs a number of tests and reports the best number of clusters based on majority rule.

nb_clusters <- NbClust(data = data.matrix(boston_scaled), distance = "euclidean", min.nc = 2, max.nc = 15, method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 
## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 3 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 3 proposed 11 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

So let’s use K=2 as the optimal number of clusters.

km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)

With K=2 clusters, we can see from the pairwise scatter plots that some of the variables are clearly divided between the clusters (in the sense of these pairwise plots), some are not.

The crime rate, for example, is clearly divided so that one of the clusters inlcudes only low rime rates, the other inlcludes both low and high crime rates. Some of the other variables are clearly dichotomous in when plotted against others - for example, rad and tax feature such tendency.


Chapter 5: Dimensionality reduction

The data

For this exercise we will use the “human”" dataset from the UNDP programme. The dataset has countries as row names. The dataset includes 155 observations of 8 variables.

human <- read.csv("/Users/lehtint9/IODS-project/data/human.csv",row.names = 1, sep=",", header=TRUE)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The eight variables describe the health, knowledge and empowerment conditions in the countries.

  • Edu2.FM Proportion of females with at least secondary education divided by proportion of males with at least secondary education
  • Labo.FM Proportion of females in the labour force divided by proportion of males in the labour force
  • Edu.Exp Expected years of schooling
  • Life.Exp Life expectancy at birth
  • GNI Gross National Income per capita
  • Mat.Mor Maternal mortality ratio
  • Ado.Birth Adolescent birth rate
  • Parli.F Percetange of female representatives in parliament

Visualize and explore data

library(GGally)
library(ggplot2)
library(dplyr)
library(corrplot)
ggpairs(human)

cor(human) %>% corrplot.mixed()

From the plots, it is clear that there are rather strong correlations between six variables variables in the dataset. The strongest positive correlations are found between expected years of schooling and life expectancy (Edu.Exp and Life.Exp) as well as maternal mortality and adolescent birth rate (Mat.Mor and Ado.Birth). The strongest negative correlations are found between the above-mentioned strongly positively correlated variables and the strongly negatively correlated ones. Interestingly, two variables (Labo.FM and Parli.F) are only weakly correlated with any of the other variables.

PCA of the dataset

PCA transforms the dataset into a new space. The dimensions of this new space are called the principal components of data. The first principal component captures the maximum amount of variance from the features of original data. Each suggessive principal component is orthogonal to the first (and other) components, and captures the maximum amount of variance left.

First we use the dataset as is and perform principal component analysis of it.

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

As we already see from the model summary, the first principal component PC1 captures almost all of the variance in the data by itself. Let’s plot the biplot to inspect the model visually.

biplot(pca_human, choices = 1:2, cex=c(0.7, 1), col = c("grey40", "deeppink2"))

We see from the biplot that the first principal componen, PC1, is practically identical to the GNI feature. The GNI arrow is much longer than the other arrows, reflecting its standard deviation. This is due to the variance of the GNI variable being really high compared to the other variables. Therefore, PCA ends up just explainen the variance of the data by the variance of GNI.

PCA of the standardised dataset

Since PCA for the original data did not provide very good results, we’ll sandardise the dataset so that the variances are equal.

human_std <- scale(human)
pca_human_std <- prcomp(human_std)
biplot(pca_human_std, choices = 1:2, cex=c(0.7, 1), col = c("grey40", "deeppink2"))

These CA results are clearly different from the results of PCA on non-standardized dataset: GNI does not dominate the variance anymore, and the results make much more “sense”.

Interpreting the results of PCA from the bilplot, we see that

  • Low values of PC1 describe countries with high expected years of schooling, high life expectancy, high female education compared to male education, and high GNI. Simultaneously, high values of PC1 describe countries with high maternal mortality ratio and high adolescent birth rate. The PC1 feature, then, could be interpreted to describe “inverse of standard of living”.

  • High values of PC2 describe countries with high proportion of females in labor force compared to males, and hih percentage of females in parliament. The PC2 feature, then, could be interpreted to describe “Gender equality in society”.

These interpretations of PC and PC2 are, of course, rough descriptions of much more nuanced societal phenomena.

Tea data

For the remaining part of the exercise we’ll use the tea dataset from FactoMineR.

library(FactoMineR)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

The tea data has 300 observations of 36 variables. We’ll pick some variables that seem interesting (places to drink tea, gender, and age groups) and visualise only these variables.

library(tidyr)
keep <- c("home", "work", "tearoom", "resto", "pub", "age_Q", "sex")
tea_ <- dplyr::select(tea, one_of(keep))

gather(tea_) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

We’ll also perform MCA for these variables.

tea_mca <- MCA(tea_, graph = FALSE)
summary(tea_mca)
## 
## Call:
## MCA(X = tea_, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.223   0.198   0.168   0.159   0.151   0.128
## % of var.             15.607  13.877  11.736  11.152  10.600   8.933
## Cumulative % of var.  15.607  29.484  41.220  52.372  62.971  71.905
##                        Dim.7   Dim.8   Dim.9  Dim.10
## Variance               0.115   0.103   0.097   0.087
## % of var.              8.028   7.198   6.769   6.100
## Cumulative % of var.  79.933  87.131  93.900 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           | -0.403  0.243  0.123 |  0.128  0.028  0.012 | -0.319  0.203
## 2           | -0.219  0.071  0.057 | -0.250  0.105  0.074 | -0.127  0.032
## 3           |  0.509  0.388  0.175 |  0.262  0.116  0.046 | -0.145  0.042
## 4           | -0.545  0.443  0.413 | -0.172  0.050  0.041 | -0.167  0.055
## 5           | -0.498  0.370  0.259 |  0.094  0.015  0.009 | -0.049  0.005
## 6           | -0.545  0.443  0.413 | -0.172  0.050  0.041 | -0.167  0.055
## 7           | -0.403  0.243  0.123 |  0.128  0.028  0.012 | -0.319  0.203
## 8           | -0.124  0.023  0.013 | -0.216  0.079  0.039 | -0.398  0.315
## 9           | -0.403  0.243  0.123 |  0.128  0.028  0.012 | -0.319  0.203
## 10          |  0.081  0.010  0.003 |  0.066  0.007  0.002 | -0.027  0.001
##               cos2  
## 1            0.077 |
## 2            0.019 |
## 3            0.014 |
## 4            0.039 |
## 5            0.003 |
## 6            0.039 |
## 7            0.077 |
## 8            0.131 |
## 9            0.077 |
## 10           0.000 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## home        |  -0.006   0.002   0.001  -0.604 |  -0.072   0.360   0.167
## Not.home    |   0.199   0.076   0.001   0.604 |   2.321  11.649   0.167
## Not.work    |  -0.340   5.249   0.282  -9.190 |  -0.260   3.454   0.165
## work        |   0.832  12.851   0.282   9.190 |   0.636   8.455   0.165
## Not.tearoom |  -0.309   4.941   0.399 -10.920 |   0.037   0.081   0.006
## tearoom     |   1.290  20.615   0.399  10.920 |  -0.156   0.338   0.006
## Not.resto   |  -0.325   4.993   0.296  -9.406 |  -0.185   1.813   0.096
## resto       |   0.910  13.968   0.296   9.406 |   0.517   5.072   0.096
## Not.pub     |  -0.283   4.058   0.302  -9.495 |   0.106   0.644   0.043
## pub         |   1.065  15.264   0.302   9.495 |  -0.400   2.423   0.043
##              v.test     Dim.3     ctr    cos2  v.test  
## home         -7.059 |  -0.039   0.126   0.049  -3.847 |
## Not.home      7.059 |   1.265   4.090   0.049   3.847 |
## Not.work     -7.029 |   0.280   4.739   0.192   7.572 |
## work          7.029 |  -0.685  11.602   0.192  -7.572 |
## Not.tearoom   1.318 |  -0.162   1.805   0.110  -5.724 |
## tearoom      -1.318 |   0.676   7.532   0.110   5.724 |
## Not.resto    -5.345 |  -0.241   3.647   0.163  -6.972 |
## resto         5.345 |   0.674  10.203   0.163   6.972 |
## Not.pub       3.567 |  -0.034   0.076   0.004  -1.128 |
## pub          -3.567 |   0.126   0.286   0.004   1.128 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## home        | 0.001 0.167 0.049 |
## work        | 0.282 0.165 0.192 |
## tearoom     | 0.399 0.006 0.110 |
## resto       | 0.296 0.096 0.163 |
## pub         | 0.302 0.043 0.004 |
## age_Q       | 0.076 0.633 0.644 |
## sex         | 0.205 0.278 0.012 |
plot(tea_mca, invisible=c("ind"), habillage = "quali")

The first two dimensions of MCA explain about 30% of variance.

The biplot for variables allows for some interpretation of tea drinking habits. For examples, we see that

  • Drinking tea in pubs, restaurants tearooms are related - that is, the same people seem to drink tea in them.
  • The youngest (15-24 year olds) and the oldest (60+ year olds) tea drinkers have similar tea drinking habits.
  • Young adults seem to not drink tea at home, though otherwise their drinking habits are not explained by any one category.
  • Workers dink tea in restaurants.
  • 45-59 year olds drink tea particularly at home.
plot(tea_mca, invisible=c("var"), habillage = "quali")

The biplot for individuals mostly shows that it is pretty hard to interpret. We mainly see that there are a few outlier individuals. This might indicate that the MCA result is to some extent dependent on these outliers.